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Abstract—A key step in program optimization is the estimation 
of optimal values for parameters such as tile sizes and loop 
unrolling factors. Traditional compilers use simple analytical 
models to compute these values. In contrast, library generators 
like ATLAS use global search over the space of parameter values 
by generating programs with many different combinations of 
parameter values, and running them on the actual hardware 
to determine which values give the best performance. It is 
widely believed that traditional model-driven optimization can- 
not compete with search-based empirical optimization because 
tractable analytical models cannot capture all the complexities 
of modern high-performance architectures, but few quantitative 
comparisons have been done to date. 

To make such a comparison, we replaced the global search 
engine in ATLAS with a model-driven optimization engine, 
and measured the relative performance of the code produced 
by the two systems on a variety of architectures. Since both 
systems use the same code generator, any differences in the 
performance of the code produced by the two systems can 
come only from differences in optimization parameter values. 
Our experiments show that model-driven optimization can be 
surprisingly effective, and can generate code with performance 
comparable to that of code generated by ATLAS using global 
search. 


Index Terms— program optimization, empirical optimization, 
model-driven optimization, compilers, library generators, BLAS, 
high-performance computing 


I. INTRODUCTION 


The sciences do not try to explain, they hardly 
even try to interpret, they mainly make models. By 
a model is meant a mathematical construct which, 
with the addition of certain verbal interpretations, 
describes observed phenomena. The justification of 
such a mathematical construct is solely and precisely 
that it is expected to work. 


John Von Neumann 


It is a fact universally recognized that current restructuring 
compilers do not generate code that can compete with hand- 
tuned code in efficiency, even for a simple kernel like matrix 
multiplication. This inadequacy of current compilers does 
not stem from a lack of technology for transforming high- 
level programs into programs that run efficiently on modern 
high-performance architectures; over the years, the compiler 
community has invented innumerable techniques such as linear 
loop transformations [5,11, 14,29,42], loop tiling [27,28, 
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43] and loop unrolling [4,32] for enhancing locality and 
parallelism. Other work has focused on algorithms for esti- 
mating optimal values for parameters associated with these 
transformations, such as tile sizes [7, 13,36] and loop unroll 
factors [4]. Nevertheless, performance-conscious programmers 
must still optimize their programs manually [15, 19]. 

The simplest manual approach to tuning a program for a 
given platform is to write different versions of that program, 
evaluate the performance of these versions on the target plat- 
form, and select the one that gives the best performance. These 
different versions usually implement the same algorithm, but 
differ in the values they use for parameters such as tile 
sizes and loop unroll factors. The architectural insights and 
domain knowledge of the programmer are used to limit the 
number of versions that are evaluated. In effect, the analytical 
techniques used in current compilers to derive optimal values 
for such parameters are replaced by an empirical search over 
a suitably restricted space of parameter values (by empirical 
search, we mean a three step process: (1) generating a version 
of the program corresponding to each combination of the 
parameters under consideration, (2) executing each version on 
the target machine and measuring its performance, and (3) 
selecting the version that performs best). This approach has 
been advocated most forcefully by Fred Gustavson and his co- 
workers at IBM, who have used it for many years to generate 
the highly optimized ESSL and PESSL libraries for IBM ma- 
chines [34]. Recently, a number of projects such as FFTW [17, 
18], PhiPAC [2,6], ATLAS [1,41], and SPIRAL [26, 33] have 
automated the generation of the different program versions 
whose performance must be evaluated. Experience shows that 
these library generators produce much better code than native 
compilers do on modern high-performance architectures. 

Our work was motivated by a desire to understand the 
performance gap between the BLAS codes produced by AT- 
LAS and by restructuring compilers, with the ultimate goal 
of improving the state of the art of current compilers. One 
reason why compilers might be at a disadvantage is that they 
are general-purpose and must be able to optimize any program, 
whereas a library generator like ATLAS can focus on a partic- 
ular problem domain. However, this is somewhat implausible 
because dense numerical linear algebra, the particular problem 
domain of ATLAS, is precisely the area that has been studied 
most intensely by the compiler community, and there is an 
extensive collection of well-understood transformations for 
optimizing dense linear algebra programs. Another reason for 
the inadequacy of current compilers might be that new trans- 
formations, unknown to the compiler community, are required 
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Fig. 1. Architecture of ATLAS and of Model-driven ATLAS 


to produce code of the same quality as the code produced by 
ATLAS. Finally, it is possible that the analytical models used 
by compilers to estimate optimal values for transformation 
parameters are overly simplistic, given the complex hardware 
of modern computers, so they are not able to produce good 
values for program optimization parameters. 

No definitive studies exist to settle these matters. Our 
research is the first quantitative study of these issues. 

Figure 1 shows our experimental set-up, which makes use 
of the original ATLAS system (top of the figure) and a 
modified version (bottom of the figure) that uses analytical 
models instead of empirical search. Like any system that uses 
empirical search, ATLAS has (i) a module that controls the 
search, which is used to determine optimal values for code 
optimization parameters (mmsearch), and (ii) a module that 
generates code, given these values (mmcase). The parameters 
used by ATLAS are described in more detail in Section II; 
for example, Np is the tile size to be used when optimizing 
code for the L1 data cache. In general, there is an unbounded 
number of possible values for a parameter like Ng so it 
is necessary to bound the size of the search space. When 
ATLAS is installed, it first runs a set of micro-benchmarks 
to determine hardware parameters such as the capacity of the 
L1 data cache and the number of registers. These hardware 
parameters are used to bound the search space. The mmsearch 
module enumerates points within this bounded search space, 
invokes the mmcase module to generate the appropriate code 
(denoted by mini-MMM in the figure), runs this code on the 
actual machine, and records its execution time. At the end of 
the search, the parameter values that gave the best performance 
are used to generate the library code. This library is coded in a 
simple subset of C, which can be viewed as portable assembly 
code, and it is compiled to produce the final executable. 

We first studied the code generation module!, and deter- 
mined that the code it produces can be viewed as the end 


'The description of ATLAS in this paper was arrived at by studying the 
ATLAS source code. In case of any discrepancy between this description and 
how the ATLAS system is actually implemented, the documentation of the 
ATLAS project should be considered to be authoritative [39-41]. 
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result of applying standard compiler transformations to high- 
level BLAS codes. As we describe in Section I, the code 
produced by ATLAS is similar to what we would get if we 
applied cache tiling, register tiling, and operation scheduling 
to the standard three-loop matrix multiplication code. This 
exercise ruled out the possibility that ATLAS incorporated 
some transformation, unknown to the compiler community, 
that was critical for obtaining good performance. We then 
modified ATLAS by replacing the search module, described 
in more detail in Section III, with a module (nmmmodel) that 
uses standard analytical models to estimate optimal values 
for the optimization parameters, as described in Section IV. 
Since both ATLAS and the modified ATLAS use the same 
code generator, we are assured that any difference in the 
performance of the generated code results solely from different 
choices for optimization parameter values. In Section V, 
we present experimental results on ten different platforms, 
comparing 


e the time spent to determine the parameter values, 
e the values of the parameters, and 
e the relative performance of generated code. 


Our results show that on all ten platforms, a relatively 
simple and very intuitive model is able to estimate near- 
optimal values for the optimization parameters used by the 
ATLAS Code Generator. We conclude in Section VI with a 
discussion of our main findings, and suggest future directions 
for research. 

One feature of ATLAS is that it can make use of hand- 
tuned BLAS routines, many of which are included in the 
ATLAS distribution. When ATLAS is installed on a machine, 
these hand-coded routines are executed and evaluated. If the 
performance of one of these hand-coded routines surpasses 
the performance of the code generated by the ATLAS Code 
Generator, the hand-coded routine is used to produce the 
library. For example, neither the ATLAS Code Generator nor 
the C compilers on the Pentium IV exploit the SSE2 vector 
extensions to the x86 instruction set, so ATLAS-generated 
matrix multiplication code on the Pentium IV runs at around 


1.5 GFLOPS. However, the matrix multiplication routine in 
the library produced by ATLAS runs at 3.3 GFLOPS because 
it uses carefully hand-coded kernels, contributed by expert 
programmers and part of the ATLAS distribution, which use 
these vector extensions. 

Our concern in this paper is not with handwritten code, but 
with the code produced by the ATLAS Code Generator and 
with the estimation of optimal values for the parameters that 
are inputs to the code generator. To make clear distinctions, 
we use the following terminology in the rest of this paper. 


e ATLAS CGw/S: This refers to the ATLAS system in 
which all code is produced by the ATLAS Code Generator 
with Search to determine parameter values. No hand- 
written, contributed code is allowed. 

e ATLAS Model: This refers to the modified ATLAS system 
we built in which all code is produced by the ATLAS 
Code Generator, using parameter values produced from 
analytical models. 

e ATLAS Unleashed: This refers to the complete ATLAS 
distribution which may use hand-written codes and prede- 
fined parameter values (architectural defaults) to produce 
the library. Where appropriate, we include, for complete- 
ness, the performance graphs for the libraries produced 
by ATLAS Unleashed. 


II. ATLAS CODE GENERATOR 


In this section, we use the framework of restructuring 
compilers to describe the structure of the code generated by 
the ATLAS Code Generator. While reading this description, it 
is important to keep in mind that ATLAS is not a compiler. 
Nevertheless, thinking in these terms helps clarify the signifi- 
cance of the code optimization parameters used in ATLAS. 

We concentrate on matrix-matrix multiplication (MMM), 
which is the key routine in the BLAS. Naive MMM code 
is shown in Figure 2. In this, and all later codes, we use the 
MATLAB notation [First : Step : Last] to represent the set 
of all integers between First and Last in steps of Step. 


for ie f[0:1:N-— 1] 
for je f0:1:M- 1] 
for kef0:1:K-1] 
Gij = Ci; + Aik X Braj 


Fig. 2. Naive MMM Code 


A. Memory Hierarchy Optimizations 


The code shown in Figure 2 can be optimized for locality 
by blocking for the L1 data cache and registers. Blocking 
is an algorithmic transformation that converts the matrix 
multiplication into a sequence of small matrix multiplications, 
each of which multiplies small blocks of the original matrices. 
Blocking matrix multiplication for memory hierarchies was 
discussed by McKellar and Coffman as early as 1969 [31]. The 
effect of blocking can be accomplished by a loop transforma- 
tion called tiling, which was introduced by Wolfe in 1987 [43]. 


e Optimization for the LI data cache: 


ATLAS implements an MMM as a sequence of mini- 
MMMs, where each mini-MMM multiplies sub-matrices 
of size Ng x Ng. Np is an optimization parameter whose 
value must be chosen so that the working set of the mini- 
MMM fits in the cache. 

In the terminology of restructuring compilers, the triply- 
nested loop of Figure 2 is tiled with tiles of size Ng x 
Np x Np, producing an outer and an inner loop nest. 
For the outer loop nest, code for both the JIK and IJK 
loop orders are implemented. When the MMM library 
routine is called, it uses the shapes of the input arrays to 
decide which version to invoke, as described later in this 
section. For the inner loop nest, only the JIK loop order 
is used, with (j’,2’,k’) as control variables. This inner 
loop nest multiplies sub-matrices of size Ng x Np, and 
we call this computation a mini-MMM. 

Optimization for the register file: ATLAS represents each 
mini-MMM into a sequence of micro-MMMs, where each 
micro-MMM multiplies an My x 1 sub-matrix of A by 
a 1 x Ny sub-matrix of B and accumulates the result 
into an My x Ny sub-matrix of C. My and Ny are 
optimization parameters that must be chosen so that 
a micro-MMM can be executed without floating-point 
register spills. For this to happen, it is necessary that 
My + Ny + Mu x Nu < Npr, where Np is the number 
of floating-point registers. 

In terms of restructuring compiler terminology, the 
(j', 2’, k’) loops of the mini-MMM from the previous step 
are tiled with tiles of size Ny x My x Ku, producing 
an extra (inner) loop nest. The JIK loop order is chosen 
for the outer loop nest after tiling, and the KJI loop order 
for the loop nest of the mini-MMM after tiling. 

The resulting code after the two tiling steps is shown in 
Figure 3. To keep this code simple, we have assumed 
that all step sizes in these loops divide the appropriate 
loop bounds exactly (so Ng divides M, N, and K, 
etc.). In reality, code should also be generated to handle 
the fractional tiles at the boundaries of the three arrays; 
we omit this clean-up code to avoid complicating the 
description. The strategy used by ATLAS to copy blocks 
of the arrays into contiguous storage is discussed later in 
this section. Figure 4 is a pictorial view of a mini- MMM 
computation within which a micro-MMM is shown using 
shaded rectangles. In this figure, the values assigned to 
variable K are produced by executing the two for loops 
in Figure 3 corresponding to indices k’ and k”. 


To perform register allocation for the array variables ref- 
erenced in the micro-MMM code, ATLAS uses techniques 
similar to those presented in [8]: the micro-MMM loop nest 
(j”, i”) in Figure 3 is fully unrolled, producing My x Ny mul- 
tiply and add statements in the body of the middle loop nest. In 
the unrolled loop body, each array element is accessed several 
times. To enable register allocation of these array elements, 
ATLAS uses scalar replacement [9] to introduce a scalar 
temporary for each element of A, B, and C that is referenced in 
the unrolled micro-MMM code, and replaces array references 
in the unrolled micro-MMM code with references to these 


// MMM loop nest (j,i, k) 
// copy full A here 
for j € [1: Ne: M] 
// copy a panel of B here 
for i€ [|1:Nsg:N] 
// possibly copy a tile of C here 
for kef[l:Nsg:K] 
// mini-MMM loop nest (j',i', k’) 
for j e€[j:Nu:j+Neg-—1] 
for i' €[i: Mu:i+Ns-1] 
for k €[|k:Ku:k+Neg-1] 
for k” e€[k’':1:k'+Ku-]] 
// micro-MMM loop nest (j”,i”) 
for j” E€[7':1:7/+Nu-] 
for iei :1:%+Mu—-J 
Cyn gr = Cyn grr + Ayn x Bergr 


Fig. 3. MMM tiled for L1 data cache and Registers 
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scalars. Appropriate assignment statements are introduced to 
initialize the scalars corresponding to A and B elements. In 
addition, assignment statements are introduced before and 
after the k’ loop to initialize the scalars corresponding to 
C elements, and to write the values back into the array 
respectively. It is expected that the back-end compiler will 
allocate floating-point registers for these scalars. 


B. Pipeline scheduling 


The resulting straight-line code in the body of the k” loop 
is scheduled to exploit instruction-level parallelism. Note that 
the operations in the k” loop are the My + Ny loads of A and 
B elements required for the micro-MMM, and the correspond- 
ing Mu x Ny multiplications and additions. On hardware 
architectures that have a fused multiply-add instruction, the 
scheduling problem is much simpler because multiplies and 
adds are executed together. Therefore, we only discuss the 
more interesting case when a multiply-add instruction is not 
present. An optimization parameter FMA tells the code 
generator whether to assume that a fused multiply-add exists. 
The scheduling of operations can be described as follows. 


e Construct two sequences of length (My x Ny), one con- 
taining the multiply operations (we will denote them by 
mul, mulz, ..., MUlMyxNy) and the other containing 
the add operations (we will denote them by addı, addz2, 

saa addMy xNy). 

e Interleave the two sequences as shown below to create a 
single sequence that is obtained by skewing the adds by 
a factor of Ls, where Ls is an optimization parameter. 
Intuitively, this interleaving separates most dependent 
multiplies and adds by 2 x L¿— 1 independent instructions 
to avoid stalling the processor pipeline. 


muly 
mulz 


mulz, 
addı 
mult +1 
addə 
mult, +2 


mulMyxNu-—1 
addMyxNu-Ls 
mul My x Nu 
addMyxNy-L.+1 
add My x Ny —L.+2 


addy x Ny 


e Inject the My + Ny loads of the elements of A and 
B into the resulting sequence of arithmetic operations 
by scheduling a block of Ip (Initial Fetch) loads in the 
beginning and blocks of Npr loads thereafter as needed. 
Ip and Np are optimization parameters. 

e Unroll the k” loop completely. The parameter Ky must 
be chosen to be large enough to reduce loop overhead, 
but not so large that the body of the k’ loop overflows 
the L1 instruction cache. 

e Reorganize the k’ loop to enable the target machine 
to overlap the loads from one iteration with arithmetic 
operations from previous iterations. Techniques for ac- 
complishing this are known as software pipelining or 
modulo scheduling [35]. 

Note that skewing of dependent adds and multiplies in- 
creases register pressure; in particular, the following inequality 
must hold to avoid register spills (that is, saving in memory 
the value stored in a processor register): 


My x Nu + Mu + Nu + Ls < NR (1) 


C. Additional details 
There are several details we have not discussed so far. 


e ATLAS considers a primitive form of L2 cache tiling, 
driven by a parameter called CacheEdge. ATLAS em- 
pirically finds the best value of CacheEdge and uses it 
to compute Kp, based on Inequality 2. 


2x Kp x Ng + N32 < CacheEdge (2) 


Kp is further trimmed to be a multiple of Ng. The 
computed value of Kp is used to block the K dimension 
of the original problem for one additional level of the 
memory hierarchy. We will not discuss CiacheEdge and 
Kp in further detail as they are outside the scope of the 
paper. 

ATLAS chooses the outermost loop order (shown as JIK 
in Figure 3) during runtime. This technique is known as 
versioning, because it requires both versions of the code 
to be compiled in the library. 

The decision of which loop order to choose is based on 
the size of matrices A and B. If A is smaller than B (N < 
M), ATLAS chooses the JIK loop order. This guarantees 
that if A fits completely in L2 or higher cache level, it 
is reused successfully by the loop nest. Similarly, if B is 
the smaller matrix (M < N), ATLAS chooses the IJK 
loop order. 

For brevity, we consider only the JIK loop order in the 
rest of the paper. 

Unless the matrices are too small or too large, ATLAS 
copies tiles of matrices A, B and C to sequential mem- 
ory to reduce the number of conflict misses and TLB 
misses during the execution of a mini- MMM. Copying 
is performed in a manner that allows the copied tiles to 
be reused by different mini- MMMs. The comments in 
Figure 3 and the discussion below explain how this goal 
is achieved for the JIK loop order. 


— Copy all tiles of A before the beginning of the 
outermost 7 loop. This is necessary as these tiles 
are fully reused in each iteration of the 7 loop. 

— Copy all tiles from the jt” vertical panel of B before 
the beginning of the ¿ loop. This is necessary as this 
panel is fully reused by each iteration of the 7 loop. 

- The single (i,j) tile of C is copied before the 
beginning of the k loop if Re > 12. This may 
reduce TLB misses which may be beneficial since 
this tile is reused by each iteration of the k loop, 
provided that the cost of copying the tile of C to a 
temporary buffer and back, can be amortized by the 
computation (large enough Kp). 


If the matrices are very small or if there is insufficient 
memory for copying tiles, the cost of copying might out- 
weigh the benefits of reducing conflict misses during the 
computation. Therefore, ATLAS generates non-copying 
versions of mini-MMM as well, and decides at runtime 
which version to use. Without copying, the number of 
conflict misses and TLB misses may rise, so it makes 
sense to use a smaller tile size for the non-copying mini- 
MMM. In ATLAS, this tile size is another optimization 
parameter called NC’Ng (non-copying Ng). Roughly 
speaking, the non-copy version is used if (i) the amount 
of computation is less than some threshold (M x N x K 
in Figure 2 is less than some threshold), and (ii) at least 
one dimension of one of the three matrices is smaller than 
3x NC Np. The non-copy version is used also when there 
is insufficient memory to perform the copying. 


D. Discussion 


Table I lists the optimization parameters for future reference. 


LI data cache tile size 
L1 data cache tile size for non-copying version 
Register tile size 


Noe poe o 
Ng 


Unroll factor for k’ loop 

Latency for computation scheduling 

1 if fused multiply-add available, 0 otherwise 
Scheduling of loads 


TABLE I 
SUMMARY OF OPTIMIZATION PARAMETERS 


It is intuitively obvious that the performance of the gener- 
ated mini-MMM code suffers if the values of the optimization 
parameters in Table I are too small or too large. For example, if 
My and Ny are too small, the My x Nu block of computation 
instructions might not be large enough to hide the latency of 
the My + Nu loads. On the other hand, if these parameters 
are too large, register spills happen. Similarly, if the value of 
Ky is too small, there is more loop overhead, but if this value 
is too big, the code in the body of the k’ loop will overflow 
the instruction cache. The goal now is to determine optimal 
values of these parameters for obtaining the best mini- MMM 
code. 


III. EMPIRICAL OPTIMIZATION IN ATLAS 


ATLAS performs a global search to determine optimal 
values for the optimization parameters listed in Table I. In 
principle, the search space is unbounded because most of the 
parameters, such as Np, are integers. Therefore, it is necessary 
to bound the search space, using parameters of the machine 
hardware; for example, My and Ny, the dimensions of the 
register tile, must be less than the number of registers. 

Since ATLAS is self-tuning, it does not require the user to 
provide the values of such machine parameters; instead, it runs 
simple micro-benchmarks to determine approximate values for 
these parameters. It then performs a global search, using the 
machine parameter values to bound the search space. 


A. Estimating machine parameters 


The machine parameters measured by ATLAS are the 
following. 


e Cı: the size of L1 data cache. 

e Np: the number of floating-point registers. 

e FMA: the availability of a fused multiply-add instruc- 
tion. 

Ls: although this is not a hardware parameter per se, 
it is directly related to the latency of floating point 
multiplication, as explained in Section II-B. ATLAS 
measures this optimization parameter directly using a 
micro-benchmark. 


The micro-benchmarks used to measure machine parameters 
are independent of matrix multiplication. For example, the 
micro-benchmark for estimating Cı is similar to the one 
discussed in Hennessy and Patterson [23]. 


Two other machine parameters are critical for performance: 
(i) the L1 instruction cache size, and (ii) the number of 
outstanding loads that the hardware supports. ATLAS does not 
determine these explicitly using micro-benchmarks; instead, 
they are considered implicitly during the optimization of 
matrix multiplication code. For example, the size of the L1 
instruction cache limits the Ky parameter in Figure 3. Rather 
than estimate the size of the instruction cache directly by 
running a micro-benchmark and using that to determine the 
amount of unrolling, ATLAS generates a suite of mini- MMM 
kernels with different Ky values, and selects the kernel that 
achieves best performance. 


B. Global search for optimization parameter values 


To find optimal values for the optimization parameters in 
Table I, ATLAS uses orthogonal line search, which finds 
an approximation to the optimal value of a function y = 
f(@1,22,...,2n), an n-dimensional optimization problem, by 
solving a sequence of n 1-dimensional optimization problems 
corresponding to each of the n parameters. When optimizing 
the value of parameter x;, it uses reference values for para- 
meters (741, Li+2,.- -, Zn) that have not yet been optimized. 
Orthogonal line search is heuristic because it does not neces- 
sarily find the optimal value even for a convex function, but 
with luck, it might come close. 

To specify an orthogonal line search, it is necessary to 
specify (i) the order in which the parameters are optimized, (ii) 
the set of possible values considered during the optimization of 
each parameter, and (iii) the reference value used for parameter 
k during the optimization of parameters 1, 2,..., k — 1. 

The optimization sequence used in ATLAS is the following. 


1) Find best Np. 

2) Find best My and Ny. 

3) Find best Ky. 

4) Find best Ls. 

5) Find best Fp, Ip, and Np. 

6) Find best NCNpB: a non-copy version of Nz. 
7) Find best clean-up codes. 


We now discuss each of these steps in greater detail. 

1) Find best Ng: In this step, ATLAS generates a number 
of mini-MMMs for matrix sizes Ng x Np where Np is a 
multiple of 4 that satisfies the following inequality: 


16 < Ng < min (30, VG) (3) 


The reference values of My and Ny are set to the values 
closest to each other that satisfy (1). For each matrix size, 
ATLAS tries two extreme cases for Ky — no unrolling (Ky = 
1) and full unrolling (Ky = Np). 

The Ng value that produces highest MFLOPS is chosen 
as “best Np” value, and it is used from this point on in all 
experiments as well as in the final versions of the optimized 
mini-MMM code. 

2) Find best My and Ny: This step is a straightforward 
search that refines the reference values of My and Ny that 
were used to find the best Ng. ATLAS tries all possible 
combinations of My and Ny that satisfy inequality (1). Cases 


when My or Ny is 1 are treated specially. A test is performed 
to see if 1 x 9 unrolling or 9 x 1 unrolling is better than 3 x 3 
unrolling. If not, unrolling factors of the form 1 x U and U x 1 
for values of U greater than 3 are not checked. 

3) Find best Ky: This step is another simple search. Unlike 
My and Ny, Ky does not depend on the number of available 
registers, so it can be made as large as desired without causing 
register spills. The main constraint is instruction cache size. 
ATLAS tries values for Ky between 4 and Ne as well 
as the special values 1 and Ng. The value that gives best 
performance (based on Ng, My and Ny as determined from 
the previous steps) is declared the optimal value for Ky. 

4) Find best Ls: In this step, ATLAS uses L, values in 
the interval [1,6] to schedule the computations in the micro- 
MMM of Figure 3 to determine the best choice for Ls. It 
also ensures that the chosen value divides My x Ny x Ky to 
facilitate instruction scheduling. 

5) Find best Fp, Ip, and Np: In this step, ATLAS searches 
for the values of Fp, Ip and Np. First, ATLAS determines 
the value of Fp (0 or 1). Then, it searches for the best value 
of the pair (IF, Nr) where Ip is in the interval [2,My+Ny] 
and Nr is in the interval [1,My+Ny-Ir]. 

6) Find best NC’ Ng: For the non-copying version of mini- 
MMM, ATLAS uses the same values of My, Nu, Ff, Ir, and 
Nr that it uses for the copying version. Without copying, the 
likelihood of conflict misses is higher, so it makes sense to 
use a smaller L1 cache tile size than in the version of mini- 
MMM that performs copying. ATLAS searches for an optimal 
value of NC NBs in the range [Ng : —4 : 4]. We would expect 
performance to increase initially as the tile size is decreased, 
but decrease when the tile size becomes too small. ATLAS 
terminates the search when the performance falls by 20% or 
more from the best performance it finds during this search. 
Finally, some restricted searches for better values of Ky and 
L, are done. 

7) Find best clean-up codes: If the tile size is not a multiple 
of the original matrix size, there may be left-over rows and 
columns, at the boundaries of the matrices, forming fractional 
tiles. To handle these fractional tiles, ATLAS generates clean- 
up code — a special mini- MMM in which one or more of the 
dimensions of the three tiles is smaller than Ng. For M and 
N clean-up only the corresponding dimension is smaller than 
Np, while for K cleanup, any of the three dimensions can be 
smaller than Np. 

For example, ATLAS generates K clean-up codes as fol- 
lows. For each value of L, representing the size of the K 
dimension, starting with L = Ng — 1 and going down, it 
generates a specialized version of the mini-MMM code in 
which some of the loops are fully unrolled. Full unrolling 
is possible because the shapes of the operands are completely 
known. When the performance of the general version falls 
within 1% of the performance of the current specialized 
version, the generation process is terminated. The current L is 
declared to be the Crossover Point. At runtime, the specialized 
versions are invoked when the dimension of the left-over tile 
is greater than L, while the general version is invoked for tile 
sizes smaller than L. 

For M and N clean-up ATLAS produces only a general 


version, as these are outer loops in the outermost loop nest 
in Figure 3 and they are not as crucial to performance as K 
clean-up is. The use of clean-up code in ATLAS is discussed 
in more detail in [39]. 


C. Discussion 


In optimization problems, there is usually a trade-off be- 
tween search time and the quality of the solution. For example, 
we can refine the parameters found by ATLAS by repeating the 
orthogonal line search some number of times, using the values 
determined by one search as the reference values for the next 
search. It is also possible to use more powerful global search 
algorithms like simulated annealing. However, the potential for 
obtaining better solutions must be weighed carefully against 
the increase in installation time. We will address this point in 
the conclusions. 


IV. MODEL-BASED OPTIMIZATION 


In this section, we present analytical models for estimat- 
ing optimal values for the parameters in Table I. To avoid 
overwhelming the reader, we first present models that ignore 
interactions between different levels of the memory hierarchy 
(in this case, L1 data cache and registers). Then, we refine the 
models to correct for such interactions. 


A. Estimating hardware Parameters 


Model-based optimization requires more machine parame- 
ters than the ATLAS approach because there is no search. The 
hardware parameters required by our model are as follows. 


e Cı, Bı: the capacity and the line size of the L1 data 
cache. 

e Cy: The capacity of the L1 instruction cache. 

e L,: hardware latency of the floating-point multiply in- 
struction 

e |ALUrp|: number of floating-point functional units 

e Np: the number of floating-point registers. 

e FMA: the availability of a fused multiply-add instruc- 
tion. 


Empirical optimizers use the values of machine parameters 
only to bound the search space, so approximate values for 
these parameters are adequate. In contrast, analytical models 
require accurate values for these parameters. Therefore, we 
have developed a tool called X-Ray [44], which accurately 
measures these values. 


B. Estimating Np 


We present our model for estimating Ng using a sequence 
of refinements for increasingly complex cache organizations. 
We start with the mini-MMM code in Figure 5, and then adjust 
the model to take register tiling into account. 

The goal is to find the value of Ng that optimizes the use 
of the L1 data cache. First, we consider a simple cache of 
capacity C1, which is fully-associative with optimal replace- 
ment policy and unit line-size. There are no conflict misses, 
and spatial locality is not important. 


for j'€[0:1:Ns-—1] 
for # €[0:1:Ne-]] 
for k’€[0:1:Ne-] 
Ci 51 = Cir 57 + Ayres x Burj 


Fig. 5. Schematic Pseudo-Code for mini-MMM 


The working set in memory of the mini-MMM loop nest in 
Figure 5 consists of three Ng x Np tiles, one from each of 
the matrices A, B, and C. For the rest of this section, we will 
refer to these tiles just as A, B, and C. This working set fits 
entirely in the cache if Inequality (4) holds. 


3N < Ci (4) 


A more careful analysis shows that it is not actually neces- 
sary for all three Ng x Np blocks to reside in the cache for 
the entire duration of the mini-MMM computation. Consider 
the mini-MMM code shown in Figure 5. Because k’ is the 
innermost loop, elements of C are computed in succession; 
once a given element of C has been computed, subsequent 
iterations of the loop nest do not touch that location again. 
Therefore, with this loop order, it is sufficient to hold a single 
element of C in the cache, rather than the entire array. The 
same reasoning shows that it is sufficient to hold a single 
column of B in the cache. Putting these facts together, we see 
that with this loop order, there will be no capacity misses if 
the cache can hold all of A, a single column of B, and a single 
element of C. This leads to Inequality (5). 


N +Ng+1<C; (5) 


1) Correcting for non-unit line size: In reality, caches have 
non-unit line size. Assume that the line size is Bı. If the 
three tiles are stored in column major order, both B and C are 
walked by columns and A is in cache for the entire duration 
of the mini-MMM. This leads to the refined constraint shown 


in Inequality (6). 

N2 NB C1 
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2) Correcting for LRU replacement policy: We can further 
relax the restrictions of our cache organization to allow for 
Least Recently Used (LRU) replacement instead of optimal 
replacement. To determine the effects of LRU replacement 
on the optimal tile size Ng, we must examine the history of 
memory accesses performed by the loop nest. This analysis is 
in the spirit of Mattson et.al. [30], who introduced the notions 
of stack replacement and stack distance. 

We start with the innermost loop of the mini-MMM loop 
nest. A single iteration (j,i, k} of this loop touches elements 


Aik; Brg; Cij; 


where the most recently accessed element is written rightmost 
in this sequence. 

Extending this analysis to the middle loop, we see that the 
sequence of memory access for a given value of the outer loop 
indices (j,i) is the following (as before, the most recently 
accessed element is rightmost): 


Aio; Bog; Cig; An; Big; Cig; -- 5 Ai, Na 13 Bug —1,95 Cig; 
Note that the location C;j is touched repeatedly, so the 

corresponding history of memory accesses from least recently 

accessed to most recently accessed is the following: 


Aio; Boj; Ait; Bij; ...; Ai, Ng—1; Bg —1,33 Cig; 


Extending this to a single iteration j of the outermost loop, 
we see that the sequence of memory accesses is the following 
(in left-to-right, top-to-bottom order): 


Aoo; Boj; Ao, Ng-1; BNg-1,33 Coj; 
A10; Boj; A1, Ng-1; BNg—1,5; Ciz 
Anz —1,03 Boj; Ang-1,Ne-13BNe-1,53 CNe-1,93 


Note that the column of B is reused Np times, and thus the 
corresponding history of memory accesses from least recently 
accessed to most recently accessed is 


Aoo; wis Ao,Np—-1; Coj; 
A10; sed Ai,Np-13 Ciz; 
Anz -1,03 Bos; ANs-—1,Ns—1; BNs-1,j; CNp-1,j; 


We do not want to evict the oldest element of this history 
(Aoo) because, as we discussed before, A is completely reused 
in all iterations of the outermost loop. Therefore we need to 
choose Np is such a way that this whole history fits in the 
cache. 

Furthermore, after the j*” iteration of the outermost loop 
is complete, the j + 1% iteration will bring in the j + 1% 
column of B, which participates in an inner product with all 
the rows of A. Because of LRU, this new column will not be 
able to “optimally” replace the old j** column of B, since the 
old column of B has been used quite recently. For the same 
reason the new element of C, namely Co,;41, will not be able 
to optimally replace the old Co;. To account for this, we need 
extra storage for an extra column of B and an extra element 
of C. 

Putting this all together, we see that if the cache is fully- 
associative with capacity C1, line size Bı and has an LRU 
replacement policy, we need to cache all of A, two columns of 
B and a column plus an element of C. This result is expressed 
formally in Inequality (7). 


N? Ng Ci 
HDH- (7) 
Finally, to model the mini-MMM code of Figure 3, which 
includes register tiling, we need to take into account inter- 
actions between the register file and the L1 cache. Thus far, 
we implicitly assumed that the computation works directly 
on the scalar elements of the tiles. As Figure 3 shows, the 
mini-MMM loop nest actually works on register tiles. We 
refine Inequality (7) by recognizing that considerations of 
rows, columns, and elements of A, B, and C respectively must 


be replaced by considerations of horizontal panels, vertical 
panels, and register tiles instead. Taking this into account, we 
get Inequality (8). 


N2 Neg x Nu Mu Ci 
E a ha e © 

3) Correcting to avoid micro-MMM clean-up code: Note 
that estimating N pg using Inequality (7), it is possible to get a 
value for Ng which is not an exact multiple of My and Ny. 
This requires the generation of clean-up code for fractional 
register tiles at the boundaries of mini-MMM tiles. This com- 
plicates code generation, and generally lowers performance. 
We avoid these complications by trimming the value of Np 
determined from Inequality (7) so that it becomes a multiple 
of My and Ny. The ATLAS Code Generator requires Np to 
be an even integer, so we enforce this constraint as well. 

If Np is the tile size obtained by using Inequality (7), we 


set Np to the value Lemar x | x lem (Mu, Nu, 2). 

Note this requires that the value of Vg be determined after 
the values of My and Ny have been determined as described 
below. 

4) Other cache organizations: If the cache organization 
is not fully-associative, conflict misses must be taken into 
account. Although there is some work in the literature on 
modeling conflict misses [10,12], these models are not com- 
putationally intractable. Therefore, we do not model conflict 
misses, although there are some general remarks we can make. 

If A, B, and C are copied to 3N% contiguous storage 
locations, Inequality (4) can also be viewed as determining the 
largest value of Np for which there are no capacity or conflict 
misses during the execution of the mini-MMM in any cache 
organization. Although ATLAS usually copies tiles, the code 
in Figure 3 shows that the three copied tiles are not necessarily 
adjacent in memory. However, if the set-associativity of the L1 
data cache is at least 3, there will be no conflict misses. 

Inequality (5) determines the largest Ng for which there are 
no capacity misses during the execution of the mini-MMM, 
although there may be conflict misses if the cache is direct- 
mapped or set-associative. Notice that these conflict misses 
arise even if data from all three matrix tiles is copied into 
contiguous memory, because the amount of the data touched 
by the program is more than the capacity of the cache, and 
some elements will map to the same cache set. 


C. Estimating My and Nu 


One can look at the register file as a software-controlled, 
fully-associative cache with unit line size and capacity equal 
to the number of available registers Np. Therefore we can use 
a variant of Inequality (5), to estimate the optimal register file 
tile size value. 

The ATLAS Code Generator uses the KIJ loop order to tile 
for the register file, and thus we need to cache the complete 
My x Ny tile of C, an 1 x Ny row of B and a single element 
of A. Therefore the analog of Inequality (5) for registers is 
Inequality (9), shown below. 


My x Nu +Nu+1< Ne (9) 


Because the register file is software controlled, the ATLAS 
Code Generator is free to allocate registers differently than 
Inequality (9) prescribes. In fact, as discussed in Section II, 
it allocates to registers a My x 1 column of A, rather than a 
single element of A. Furthermore, it needs L, registers to store 
temporary values of multiplication operations to schedule for 
optimal use of the floating point pipelines. Taking into account 
these details, we refine Inequality (9) to obtain Inequality (10). 


My x Nu + Ny + Mu + Ls < NR (10) 


Np is a hardware parameter, which is measured by the 
micro-benchmarks. The value of the optimization parameter 
L, is estimated as discussed in Section IV-E. Therefore the 
only unknowns in Inequality (10) are My and Ny. We 
estimate their values using the following procedure. 

e Let My = Ny = u. Solve Inequality (10) for u. 

e Let Mu = max (u, 1). Solve Inequality (10) for Nu. 

e Let Ny = max (Nu, 1) 

e Let (Mu, Nu) = (max (Mu, Nu) ,min (Mu, Nu)). 


D. Estimating Ky 


Although Ky is structurally similar to My and Ny, it is 
obviously not limited by the size of the register file. Therefore 
the only practical limit for Ky is imposed by the size of the 
instruction cache. To avoid micro-MMM clean-up code, we 
trim Ky so that Npg is a multiple of Ky. Note that if Ky = 
Ng it is left unchanged by this update. 

Therefore our model for estimating Ky is to unroll the 
loop as far as possible within the size constraints of the 
L1 instruction cache, while ensuring that Ky divides Np. 
On most platforms, we found that the loop can be unrolled 
completely (Ky = Np). 


E. Estimating L, 


L, is the optimization parameter that represents the skew 
factor the ATLAS Code Generator uses when scheduling 
dependent multiplication and addition operations for the CPU 
pipeline. 

Studying the description of the scheduling in Section II, 
we see that the schedule effectively executes L, independent 
multiplications and Ls — 1 independent additions between a 
multiplication mul; and the corresponding addition add;. The 
hope is that these 2 x L, — 1 independent instructions will hide 
the latency of the multiplication. If the floating-point units are 
fully pipelined and the latency of multiplication is Lx, we get 
the following inequality, which can be solved to obtain a value 
for Ls. 


2x Ls—1> Ly (11) 


On some machines, there are multiple floating-point units. 
If |ALUpp| is the number of floating-point ALUs, Inequal- 
ity (11) gets refined as follows. 

2xL,—1 
——— >L 
|ALUFp| Ss 


Solving Inequality (12) for Ls, we obtain Inequality (13). 


(12) 


(13) 
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Of the machines in our study, only the Intel Pentium 
machines have floating-point units that are not fully pipelined; 
in particular, multiplications can be issued only once every 
2 cycles. Nevertheless, this does not introduce any error 
in our model because ATLAS does not schedule back-to- 
back multiply instructions, but intermixes them with additions. 
Therefore, Inequality (11) holds. 


F. Estimating other parameters 


Our experience shows that performance is insensitive to the 
values of Fp, Ip, and Np optimization parameters. Therefore 
we set Fp = 1(true), Ip = 2 and Np = 2. 

FMA is a hardware parameter, independent of the specific 
application. If our micro-benchmarks determine that the archi- 
tecture supports a fused multiply-add instruction, we set this 
parameter appropriately. 

Finally, we set NCNg = Np. That is, we use the same 
tile size for the non-copying version of mini-MMM as we 
do for the copying version. In our experiments, ATLAS 
always decided to use the copying version of mini-MMM7?, 
so the value of this parameter was moot. A careful model 
for NC Np is difficult because it is hard to model conflict 
misses analytically. There is some work on this in the compiler 
literature but most of the models are based on counting 
integer points within certain parameterized polyhedra and 
appear to be intractable [10, 12]. Fraguela et. al. have proposed 
another approach to modeling conflict misses when the sizes 
of matrices are known [16]. In some compilers, this problem 
is dealt with heuristically by using the effective cache capacity, 
defined to be a fraction (such as 2) of the actual cache capacity, 
when computing the optimal tile size. In our context, we 
could set NC Npe to the value determined from Inequality (7) 
with Cı replaced with G, We recommend this approach 
should it become necessary to use a smaller tile size on some 
architectures. 


G. Discussion 


We have described a fairly elaborate sequence of models 
for estimating the optimal value of Nz. In practice, the value 
found by using Inequality (6), a relatively simple model, is 
close to the value found by using more elaborate models such 
as Inequalities (7) and (8). 


V. EXPERIMENTAL RESULTS 


Models are to be used, not believed. 
H. Theil ‘Principles of Econometrics’ 


In this section, we present the results of running ATLAS 
CGw/s and ATLAS Model on ten common platforms. For all 
experiments we used the latest stable version of ATLAS, which 
as of this writing is 3.6.0. Where appropriate, we also present 


?Using the non-copy version is mainly beneficial when the matrices 
involved in the computation are either very small or are long and skinny [37]. 


numbers for ATLAS Unleashed and vendor supported, native 
BLAS. 
We did our experiments on the following platforms. 


e RISC, Out-of-order 


— DEC Alpha 21264 
— IBM Power 3 

— IBM Power 4 

— SGI R12K 


e RISC In-order 


— Sun UltraSPARC Ili 
— Intel Itanium2 
e CISC, Out-of-order 

— AMD Opteron 240 

- AMD Athlon MP 

— Intel Pentium II 

— Intel Pentium 4 

For each platform, we present the following results. 
e Times: 

— X-Ray: time taken by X-Ray to determine hardware 
parameters. 

— ATLAS Micro-benchmarks: time taken by the micro- 
benchmarks in ATLAS to determine hardware para- 
meters. 

— ATLAS Optimization Parameter Search: time taken 
by global search in ATLAS for determining opti- 
mization parameter values. 


We do not report the actual installation time of any of the 
versions of ATLAS because most of this time is spent in 
optimizing other BLAS kernels, generating library code, 
building object modules, etc. 

We do not discuss the timing results in detail as they are 
not particularly surprising. X-Ray is faster than ATLAS 
in measuring hardware parameters on nine out of the ten 
platforms, and has comparable timing (10% slower) on 
one (IBM Power 3). Moreover, while ATLAS CGw/S 
spends considerable amount of time, ranging between 8 
minutes on the DEC Alpha to more than 8 hours on the 
Intel Itanium 2, to find optimal values for optimization 
parameters, the model-based approach takes no measur- 
able time. 

e Performance: 


— Optimization parameter values: values determined 
by ATLAS CGw/S and ATLAS Model. Where ap- 
propriate, we also report these values for ATLAS 
Unleashed. 

— mini-MMM performance: performance of mini- 
MMM code produced by ATLAS CGw/S, ATLAS 
Model and ATLAS Unleashed. 

— MMM performance: for matrices sized 100 x 100 to 
5000 x 5000. We report performance of complete 
MMM computations using (i) vendor supported, na- 
tive BLAS, and the code produced by (ii) ATLAS 
CGw/S, (iii) ATLAS Model, (iv) ATLAS Unleashed, 
and (v) the native Fortran compiler. On each plat- 
form, the code produced by ATLAS is compiled with 
the best C compiler we could find on that platform. 


The input to the FORTRAN compiler is the standard 
triply-nested loop shown in Figure 2. 

For vendor supported, native BLAS (labeled “BLAS” 
on all figures) we used to following libraries and 
corresponding versions, which were current at the 
time of our experiments: 


DEC Alpha: CXML 5.2 

IBM Power 3/4: ESSL 3.3 

SGI R12K: SCSL 6.5 

SUN UltraSPARC Illi: Sun One Studio 8 
Intel Itanium 2, Pentium III/4: MKL 6.1 
* AMD Opteron, Athlon: ACML 2.0 


Sensitivity Analysis: this describes the relative change 
of performance as we change one of the optimization 
parameters, keeping all other parameters fixed to the 
values found by ATLAS CGw/S. Sensitivity analysis 
explains how variations in the values of optimization 
parameters influence the performance of the generated 
mini-MMM kernel. 
— Np: change in mini-MMM performance when the 
value of Ng is changed 
— My, Nu: change in mini-MMM performance when 
values of My and Ny are changed. Because optimal 
values of My and Ny depend on the same hardware 
resource (Vr), we vary them together. 
— Ky: change in min-MMM performance when value 
of Ky is changed. 
— L,: change in mini-MMM performance when Ls is 
changed. 
— Fp, Ip and Np: we do not show sensitivity graphs 
for these parameters because performance is rela- 
tively insensitive to their values. 


x * *¥ * * 


A. DEC Alpha 21264 


1) mini-MMM: On this machine the model-determined 
optimization parameters provided performance of about 100 
MFLOPS (7%) slower than the ones determined by search. 
The reason of the difference is the suboptimal selection of the 
Ne parameter (84 for Atlas Model vs. 72 for ATLAS CGw/S), 
as can be seen in the Nz sensitivity graph of Figure 12(g). 

2) MMM Performance: Figure 12(d) shows the MMM 
performance. 

ATLAS Unleashed produces the fastest BLAS implemen- 
tation because it uses highly-optimized, hand-tuned BLAS 
kernels written by Goto. A newer version of these kernels is 
described in [25]. The native BLAS library is only marginally 
slower. 

Although the gap in performance of the mini- MMM codes 
produced by ATLAS CGw/S and ATLAS Model is 100 
MFLOPS, the gap in performance of complete MMM com- 
putations is only about 50 MFLOPS (4%) for large matrices. 
Finally, we note that the GNU FORTRAN compiler is unable 
to deliver acceptable performance. We did not have access to 
the Compaq FORTRAN compiler, so we did not evaluate it. 

3) Sensitivity Analysis: Figure 12(e) shows the sensitivity 
of performance to the values of My and Ny. The optimal 
value is (4,4), closely followed by (3,6), and (6,3). These 


match our expectations that optimal unroll factors are as close 
to square as possible, while dividing the tile size Ng = 72 
without reminder. 

Figure 12(f) shows the sensitivity of performance to the 
value of Np. Figure 12(g) shows a scaled-up version of this 
graph in the region of the optimal Ng value. The optimal 
value for Ng is 88. ATLAS does not find this point because 
it does not explore tile sizes greater than 80, as explained in 
Section III, but it chooses a tile size of 72, which is close to 
optimal. If we use Inequality (8) to determine Np analytically, 
we obtain Ng = 84. Note that using the simpler model of 
Inequality (6), we obtain Ng = 90, which appears to be almost 
as good as the value determined by the more complex model. 

The Np sensitivity graph of Figure 12(g) has a saw-tooth of 
periodicity 4, with notable peaks occurring with a periodicity 
of 8. The saw-tooth of periodicity 4 arises from the interaction 
between cache tiling and register tiling - the register tile is 
(4,4), so whenever Np is divisible by 4, there is no clean-up 
code for fractional register tiles in the mini-MMM code, and 
performance is good. We do not yet understand why there are 
notable peaks in the saw-tooth with a periodicity of 8. 

Figure 12(h) shows the sensitivity of performance to the 
value of Ky. On this machine the entire mini- MMM loop 
body can fit into the L1 instruction cache for values of Ky 
up to Nz. Performance is relatively insensitive to Ky as long 
as the value of this parameter is sufficiently large (Ky > 7). 

Figure 12(i) shows the sensitivity of performance to the 
value of Ls. The graph is convex upwards, with a peak at 4. 
The multiplier on this machine has a latency of 4 cycles, so 
the model for Ls in Section IV, computes Ls = 5, which is 
close to optimal. The inverted-U shape of this graph follows 
our expectations. For very small values of Ls, dependent 
multiplications and additions are not well separated and CPU 
pipeline utilization is low. As L, grows, the problem gradually 
disappears, until the performance peak is reached when the full 
latency of the multiplication is hidden. Increasing L, further 
does not improve performance as there is no more latency to 
hide. On the contrary, more temporary registers are needed to 
save multiplication results, which causes more register spills 
to memory, decreasing performance. 


B. IBM Power 3 


1) mini-MMM: On this machine, mini-MMM code pro- 
duced by ATLAS Model is about 40 MFLOPS (3%) slower 
than mini-MMM code produced by ATLAS CGw/S. Fig- 
ure 13(g) shows that one reason for this difference is the sub- 
optimal choice of Np; fixing the values of all parameter other 
than Np to the ones chosen by ATLAS CGw/S and using the 
model-predicted value of 84 for Ng results in mini-MMM 
code that performs about 100 MFLOPS worse than the mini- 
MMM code produced by ATLAS CGw/S. 

2) MMM Performance: For multiplying large matrices, the 
handwritten BLAS as well as the codes produced by ATLAS 
CGw/S, ATLAS Model, and ATLAS Unleashed perform al- 
most identically. 

3) Sensitivity Analysis: Figure 13(f) shows the sensitivity 
of performance to the value of Ng. Figure 13(g) shows a 


scaled-up version of this graph in the region of the optimal 
Np value. 

Figure 13(e) shows the sensitivity of performance to the 
values of My and Ny. 

Figure 13(h) shows the sensitivity of performance to the 
value of Ky. On this machine, the entire mini-MMM loop 
body can fit into the L1 instruction cache for values of Ky up 
to Np. Performance is relatively insensitive to Ky as long as 
the value of this parameter is sufficiently large (Ky > 5). We 
do not understand the sudden drop in performance at Ky = 3. 

Figure 13(i) shows the sensitivity of performance to the 
value of Ls. The Power 3 platform has a fused multiply-add 
instruction, which the ATLAS micro-benchmarks and X-ray 
find, and the Code Generator exploits, so performance does 
not depend on the value of Ls. 


C. IBM Power 4 


1) mini-MMM: On this machine, mini-MMM code pro- 
duced by ATLAS Model is about 70 MFLOPS (2%) slower 
than mini-MMM code produced by ATLAS CGw/S. Fig- 
ure 14(g) shows that one reason for this difference is a slightly 
sub-optimal choice of Np; fixing the values of all parameter 
other than Np to the ones chosen by ATLAS CGw/S and using 
the model-predicted value of 56 for Ng results in mini-MMM 
code that performs slightly worse than the mini- MMM code 
produced by ATLAS CGw/S. 

2) MMM Performance: Figure 14(d) shows MMM perfor- 
mance. For large matrices, the hand-tuned BLAS perform 
the best, although by a small margin. The code produced 
by ATLAS Model, ATLAS CGw/S and ATLAS Unleashed 
perform almost identically. On this machine the native IBM 
XL Fortran compiler produced relatively good results for small 
matrices. 

3) Sensitivity Analysis: Figure 14(e) shows the sensitivity 
of performance to changes in the values of My and Ny. The 
parameter values (4,4) perform best, and these are the values 
used by both ATLAS CGw/S and ATLAS Model. 

Figure 14(f) shows the sensitivity of performance to the 
value of Ng. Figure 14(g) shows a scaled-up version of this 
graph in the neighborhood of the Ng value determined by 
ATLAS CGw/S. Figure 14(f) shows that on this machine, Vp 
values between 150 and 350 give the best performance of 
roughly 3.5 GFLOPS. Using Inequality (4) for the L2 cache 
(capacity of 1.5 MB) gives Ng= 254, while Inequality (8) 
gives Np= 436, showing that on this machine, it is better to 
tile for the L2 cache rather than the L1 cache. 

Figure 14(h) shows the sensitivity of performance to the 
value of Ky. The L1 instruction cache on this machine is 
large enough that we can set Ky to Ng. As on the Power 3, 
unrolling by 3 gives poor performance for reasons we do not 
understand. 

Figure 14(i) shows the sensitivity of performance to the 
value of Ls. The Power 4 platform has a fused multiply-add 
instruction, which the ATLAS micro-benchmarks find and the 
Code Generator exploits, so performance does not depend on 
the value of Lg. 


D. SGI R1I2K 


1) mini-MMM: On this machine, mini- MMM code pro- 
duced by ATLAS Model is about 20 MFLOPS (4%) slower 
than mini-MMM code produced by ATLAS CGw/S. The 
performance of both codes is similar to that of mini-MMM 
code produced by ATLAS Unleashed. 

2) MMM Performance: Figure 15(d) shows MMM per- 
formance. The hand-coded BLAS perform best by a small 
margin. On this machine the native compiler (in this case, the 
SGI MIPSPro) generated relatively good code that was only 
20% lower in performance than the hand-coded BLAS, at least 
for small matrices. 

3) Sensitivity Analysis: Figure 15(e) shows the sensitivity 
of performance to the values of My and Ny. This machine 
has a relatively large number of registers (32), so there is a 
fairly broad performance plateau in this graph. 

Figure 15(f) shows the sensitivity of performance to the 
value of the Ng. Figure 15(g) shows a scaled-up version of 
this graph in the region of the optimal Nz value. Figure 15(f) 
shows that on this machine, Np values between 300 and 500 
give the best performance of roughly 510 MFLOPS. Using 
Inequality (4) for the L2 cache (capacity of 4MB) gives Ng= 
418, while Inequality (8) gives Ng = 718, showing that on 
this machine, it is better to tile for the L2 cache rather than 
the L1 cache. 

Figure 15(h) shows the sensitivity of performance to the 
value of the Ky. On this machine, the instruction cache is 
large enough that full unrolling (Ky=Np) is possible. 

Figure 15(i) shows the sensitivity of performance to the 
value of the Ls. The R12K processor has a fused multiply- 
add instruction, so we would expect performance of the 
generated code to be insensitive to the value of Ls. While 
this is borne out by Figure 15(i), notice that Table 15(b) 
shows that the micro-benchmark used by ATLAS did not 
discover the fused multiply-add instruction on this machine 
(FMA = 0)! It is worth mentioning that on this platform 
the FMA instruction, while present in the ISA, is not backed 
up by a real FMA pipeline in hardware. Instead it allows the 
two separate functional units (for multiplication and addition 
respectively) to be used sequentially saving one latency cycle. 
Therefore, in theory, peak performance is achievable even 
by using separate multiply and add instructions. Although 
ATLAS Code Generator schedules code using Ls = 3, the SGI 
MIPSPro compiler is clever enough to discover the separated 
multiplies and adds, and fuse them. In fact the compiler is 
able to do this even when Ls = 20, which is impressive. 


E. Sun UltraSPARC IIIi 


1) mini-MMM: On this machine, mini-MMM code pro- 
duced by ATLAS Model is about 160 MFLOPS (17%) faster 
than mini-MMM code produced by ATLAS CGw/S. The main 
reason for this is that the micro-benchmarks used by ATLAS 
incorrectly measured the capacity of the L1 data cache as 16 
KB, rather than 64 KB. Therefore ATLAS only searched for 
Ne values less than 44. Our micro-benchmarks on the other 
hand correctly measured the capacity of the L1 cache, so the 
model estimated Ng = 84, which gave better performance as 
can be seen in Figure 16(g). 


2) MMM Performance: Figure 16(d) shows the MMM 
performance. On this machine, the hand-coded BLAS and AT- 
LAS Unleashed performed roughly 50% better than the code 
produced by ATLAS CGw/S. The reason for this difference is 
that the mini- MMM code in ATLAS Unleashed (and perhaps 
the hand-coded BLAS) pre-fetches portions of the A and B 
matrices required for the next mini-MMM. This may be related 
to the Level-3 pre-fetching idea of Gustavson et. al. [3]. 

3) Sensitivity Analysis: Figure 16(e) shows the sensitivity 
of performance to the values of My and Ny. 

Figure 16(f) shows the sensitivity of performance to the 
value of the Ng. Figure 16(g) shows a scaled-up version of 
this graph in the region of the optimal Ng value. On this 
machine, as on many other machines, it is better to tile for the 
L2 cache, as can be seen in Figure 16(f). Using Inequality (4) 
for the L2 cache (capacity of 1 MB), we obtain Ng = 208, 
which gives roughly 1380 MFLOPS. Using Inequality (8), 
we obtain Ng = 356, which is close to the Ng value in 
Figure 16(f) where the performance drops rapidly. 

Figure 16(h) shows the sensitivity of performance to the 
value of the Ky. On this machine, the instruction cache is 
large enough that full unrolling (Ky=Np) is possible. 

Figure 16(i) shows the sensitivity of performance to the 
value of the Ls. This machine does not have a fused multiply- 
add instruction, so the value of the Ls parameter affects 
performance. Both the model and ATLAS CGw/S find good 
values for this parameter. 


F. Intel Itanium 2 


1) mini-MMM: On this machine, the mini-MMM code 
produced by ATLAS Model is about 2.2 GFLOPS (55%) 
slower than mini-MMM code produced by ATLAS CGw/S. 
This is a rather substantial difference in performance, so it is 
necessary to examine the sensitivity graphs to understand the 
reasons why ATLAS Model is doing so poorly. 

Figure 17(g) shows that one reason for this difference 
is that ATLAS Model used Ng = 30, whereas ATLAS 
CGw/S used Ng = 80. ATLAS CGw/S uses Ng = 80 
because it disregards the L1 data cache size (16KB) and 
considers directly the L2 cache size (256KB), and therefore 
the expression min (80, vc) in Inequality (3) evaluates to 
80, the largest possible value of Np in the search space used 
by ATLAS. 

While the value Ng = 30 used by ATLAS Model is correct 
with respect to the L1 data cache size, Intel Itanium 2 does 
not allow storing floating point numbers in the L1 data cache, 
and thus L2 has to be considered instead. Once we incorporate 
in X-Ray the ability to measure this specific hardware feature, 
the shortcoming of ATLAS Model will be resolved. 

2) MMM Performance: Figure 17(d) shows MMM perfor- 
mance. The hand-written BLAS and ATLAS Unleashed give 
the best performance. The code produced by ATLAS CGw/S 
runs about 1.5 GFlops slower than the hand-written BLAS, 
while the code produced by ATLAS Model runs about 3.5 
GFlops slower. 

3) Sensitivity Analysis: Figure 17(e) shows the sensitivity 
of performance to the values of My and Ny. The Itanium has 


128 general-purpose registers, so the optimal register tiles are 
relatively large. There is a broad plateau of (My,Nv) values 
that give excellent performance. 

Figure 17(f) shows the sensitivity of performance to the 
value of the Ng. Figure 17(g) shows a scaled-up version of 
this graph in the region of the optimal Ng value. Figure 17(f) 
shows that on this machine, the best performance is obtained 
by tiling for the L3 cache! Indeed, using Inequality (4) for the 
L3 cache (capacity of 3 MB), we obtain Ng = 360, which 
gives roughly 4.6 GFLOPS. Figure 17(f) shows that this value 
is close to optimal. Using Inequality (8), we obtain Ng = 610, 
which is close to the Ng value in Figure 17(f) where the 
performance starts to drop. 

Figure 17(h) shows the sensitivity of performance to the 
value of Ky. On the Itanium, unlike on other machines in 
our study, performance is highly sensitive to the value of 
Kv. The main reason is the large register tile (My, Nu) = 
(10, 10); after unrolling the micro-MMM loops, we get a very 
long straight-line code sequence. Furthermore, unrolling of 
the k” loop creates numerous copies of this code sequence. 
Unfortunately, the L1 instruction cache on this machine has a 
capacity of 32 KB, so it can hold only about 9 copies of the 
micro-MMM code sequence. Therefore, performance drops off 
dramatically for values of Ky greater than 9 or 10. 

Since this is the only machine in our study in which the Ky 
parameter mattered, we decided to investigate the sensitivity 
graph more carefully. Figure 6 shows a magnified version of 
Figure 17(h) in the interval Ky € [0,15]. We would expect the 
Ky sensitivity graph to exhibit the typical inverted-U shape, 
and it more or less does. However, performance for Ky = 7 
is significantly worse than the performance for Ky = 6, and 
Ky = 8, which appears anomalous. 

The anomaly arises from clean-up code that is required 
when Ky does not divide Ng evenly (see the k’ loop in 
the tiled code in Figure 3). If we unroll the k’ loop by 
Ky, the number of times the completely unrolled micro- 
MMM code is replicated inside the mini-MMM is not Ky, 
but Ky + Np%Ky (% is the reminder from integer division). 
The first term in the sum is the expected number of repetitions 
inside the unrolled k’ loop, while the second part is the clean- 
up code which takes care of the case when Ky does not divide 
Np exactly. This second piece of code is still part of the mini- 
MMM loop nest, and it has to be stored in the L1 instruction 
cache during execution to achieve optimal performance. 

For Ng = 80, performance increases initially as Ky 
increases because loop overhead is reduced. When Ky = 6, 
there are 8 copies of the unrolled micro-MMM code in the 
mini-MMM, and this is close to the I-cache limit. When 
Ku = 7, there are 7 + 80%7 = 10 copies of the micro- 
MMM code, which exceeds the I-cache limit, and performance 
drops substantially. However, when Ky = 8, there is no clean- 
up code, and there are only 8 copies of the unrolled micro- 
MMM code, so performance goes up again. Beyond this point, 
the code sizes overflows the I-cache and grows larger, and 
performance degrades gradually. Ultimately, performance is 
limited by the rate at which L1 I-cache misses can be serviced. 
For Ng = 360, the trends are similar, but the effect of clean- 
up code is less because the clean-up code performs a smaller 


fraction of the computations of the k’ loop (less than 1% 


compared to about 5% for Ng = 80). 
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Fig. 6. Intel Itanium 2: Sensitivity of performance to Ky 

Figure 17(1) shows the sensitivity of performance to the 
value of the L.. The Itanium has a fused multiply-add in- 
struction, so performance is insensitive to the L, parameter. 

In summary, the code produced by ATLAS Model on this 
machine did not perform as well as the code produced by AT- 
LAS CGw/S. However, this is because ATLAS Model tiled for 
the L1 cache, whereas on this machine, the best performance 
is obtained by tiling for L3 cache. ATLAS CGw/S gets better 
performance because the tile size is set to a larger value than 
the value used by ATLAS Model. 


G. AMD Opteron 240 


1) mini-MMM;: Table 18(c) shows that on this machine, the 
mini-MMM code generated by ATLAS Model runs roughly 
38% slower than the code generated by ATLAS CGw/S. The 
values of almost all optimization parameters determined by 
the two systems are different, so it is not obvious where the 
problem is. To get some insight, it is necessary to look at the 
sensitivity graphs. 

Figure 18(f) shows the performance sensitivity graph for 
Neg. Both 60 and 88 appear to be reasonable values, so 
the problem with ATLAS Model is not in its choice of 
Neg. Because Ky is bound to the value of Ng, the only 
remaining differences are those between My, Nuy, Ls, and 
F M A. Table 18(b) shows that ATLAS Model chose My = 2, 
Ny = 1, FMA = 0, while ATLAS CGw/S chose My = 6, 
Ny = 1, FMA = 1. We verified experimentally that if the 
model had chosen My = 6 and FMA = 1, keeping the 
rest of the parameters the same, the mini-MMM performance 
becomes 2050 MFLOPS, closing the performance gap with 
ATLAS CGw/S. 

The parameters values used by ATLAS CGw/S are puzzling 
for several reasons. First, the Opteron does not have an FMA 
instruction, so it is not clear why ATLAS CGw/S chose to set 
FMA = 1. Second, choosing 6 and 1 for the values of My 
and Ny violates Inequality (10) since the Opteron has only 8 
registers. 

We address the problem of the register-tile size first. Recall 
that Inequality (10) stems from the fact that ATLAS uses 
registers to multiply an My x 1 vector-tile of matrix A (which 
we call a) with a 1 x Ny vector-tile of matrix B (which 


we call b), accumulating the result into an My x Nuy tile 
of matrix C (which we call č). Notice that if Ny = 1, then 
b is a single scalar that is multiplied by each element of a. 
Therefore no reuse exists for elements of a. This observation 
lets us generate the code in Figure 7, which uses 1 register for 
b (rb), 6 registers for ¢ (rc, ...rcg) and 1 temporary register 
(rt) to hold elements of a. 


rei — Cı ...TC6 — Ce 


loop k 


{ : 
rb — bi 
rt — ai 


rt — rt x rb 
rea — re +rt 


rt — a2 
rt — rt x rb 
Treg — reg + rt 


rt — a6 
rt —rtxrb 
rce — ree t+ rt 


Cy EH TC x CEO TC6 


Fig. 7. (My, Nu) = (6,1) code for x86 CISC 

Even if there are enough logical registers, this kind of 
scheduling may be beneficial if the ISA is 2-address rather than 
3-address, because one of the operands is overwritten. This is 
true on the Opteron when the 16 SSE vector registers are 
used to hold scalar values, which is GCC’s default behavior. 
Even though Inequality 1 prescribes 3 x 3 register tiles, the 
refined model prescribes 14 x 1 tiles. Experiments show that 
this performs better [38]. 

One might expect that this code will not perform well 
because there are dependences between most of the instruc- 
tions that arise from the use of temporary register rt. In fact, 
experiments show that the code in Figure 7 performs well 
because of two architectural features of the Opteron. 

1) Out-of-order execution: it is possible to schedule several 
multiplications in successive cycles without waiting for 
the first one to complete. 

2) Register renaming: the single temporary register rt is 
renamed to a different physical register for each pair of 
multiply-add instructions. 

Performing instruction scheduling as described in Section II 
requires additional logical registers for temporaries, which in 
turn limits the sizes of the register tiles. When an architecture 
provides out-of-order execution and a small number of logical 
registers, it is better to use the logical registers for allocating 
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larger register tiles and leave instruction scheduling to the 
out-of-order hardware core which can use the extra physical 
registers to hold the temporaries. 

These insights permit us to refine the model described 
in Section IV as follows: for processors with out-of-order 
execution and a small number of logical registers, set Ny = 1, 
My = Nr-2, FMA=1. 

To finish this story, it is interesting to analyze how the 
ATLAS search engine settled on these parameter values. Note 
that on a processor that does not have a fused multiply-add 
instruction, FMA = 1 is equivalent to FMA = 0 and 
Ls = 1. The code produced by the ATLAS Code Generator 
is shown schematically in Figure 8. Note that this code uses 
6 registers for G (ra, ... rag), 1 register for b (rb), 6 registers 
for € (rcy...rcg) and 1 temporary register (implicitly by 
the multiply-add statement). However, the back-end compiler 
(GCC) reorganizes this code into the code pattern shown in 
Figure 7. 


rei — C1 ...TC6 — Ce 


loop k 


rai — ai 
rb — bı 


ra — re +raı x rb 
raz — a2 
raz — a3 
rea — re + raz x rb 
rez; — rc3 + rag x rb 
ras, — aa 
Tras — a5 
rea — rea + ra, x rb 
res — rcs + ras x rb 
rag — a6 


reg — rce + ras x rb 


Cı —17C1...Ce — TC6 


Fig. 8. ATLAS unroll (My, Ny) = (6,1) code for x86 CISC 

Notice that the ATLAS Code Generator itself is not aware 
that the code of Figure 7 is optimal. However, setting FM A = 
1 (even though there is no fused-multiply instruction) produces 
code that triggers the right instruction reorganization heuristics 
inside GCC, and performs well on the Opteron. This illustrates 
the well-known point that search does not need to be intelligent 
to do the right thing! Nevertheless, our refined model explains 
the observed performance data, makes intuitive sense, and can 
be easily incorporated into a compiler. 

2) MMM Performance: Figure 18(d) shows the MMM 
performance. ATLAS Unleashed is once again the fastest 
implementation here, as it uses the highly-optimized, hand- 
tuned BLAS kernels, using the SSE2 SIMD instructions, for 
which the ATLAS Code Generator does not generate code. 
The native BLAS library is about 200 MFLOPS slower on 


average. ATLAS CGw/S and ATLAS Model perform at the 
same level as their corresponding mini-MMM kernels. 

Refining the model as explained above brings ATLAS 
Model on par with ATLAS CGw/s. To bridge the gap between 
ATLAS CGw/S and user contributed code, we would need 
a different code generator — one that understands SIMD and 
prefetch instructions. GCC exposes these as intrinsic functions 
and we plan to explore this in our future work. 

3) Performance Sensitivity Analysis: Figure 18(f) shows the 
sensitivity of performance to the value of the Ng optimization 
parameter. The first drop in performance is the result of L1 
data cache misses starting to occur. This fact is accurately 
captured by our model for Nz in Inequality (8). Solving the 
inequality for C = 8192 (the L1 data cache capacity in double- 
sized floating-point values), we obtain Ng = 89. Similarly the 
second drop in performance in Figure 18(f) can be explained 
by applying the same model to the 1MB L2 cache. 

Figure 18(e) shows the performance sensitivity to the values 
of the My and Ny optimization parameters. As discussed in 
Section V-G.1, the optimal value, is (6, 1). From the graph we 
can see that the only plausible values are those with Ny = 1. 
Furthermore, performance increases while we grow My from 
1 to 6, while it suddenly drops for My = 7. This is easily 
explained by our refined model, as My + 2 < Npr would 
require 9 registers, while only 8 are available. 

Figure 18(h) shows the performance sensitivity to the value 
of the Ky optimization parameter. On this machine the entire 
mini-MMM loop body can fit into the L1 instruction cache 
for arbitrary Kuy values (up to Ky = Ng). Performance is 
relatively insensitive to Ky as long as this unroll factor is 
sufficiently large (Ky > 10). 

Figure 18(i) shows the performance sensitivity to the value 
of the L, optimization parameter. As we mentioned before, 
when FMA = 1, the Ls, optimization parameter does not in- 
fluence the generated code. Therefore, performance is constant 
with respect to Ls. 


H. AMD Athlon MP 


The AMD Athlon implements the x86 instruction set, so we 
would expect the experimental results to be similar to those 
on the Opteron. 

1) mini-MMM: Table 19(c) shows that on this machine, the 
mini-MMM code generated by ATLAS Model runs roughly 
20% slower than the code generated by ATLAS CGw/S. 
Figure 19(f) shows that the choice of Ng made by the model 
is reasonable, while Figure 19(e) shows that the register-tile 
values were not chosen optimally by the model, as on the 
Opteron. The problem and its solution are similar to those on 
the Opteron. 

2) MMM Performance: Figure 19(d) shows MMM perfor- 
mance. ATLAS Unleashed out-performs the other approaches 
by a significant margin. The hand-coded BLAS do almost as 
well, followed by ATLAS CGw/S. 

3) Sensitivity Analysis: Figure 19(e) shows the sensitivity 
of performance to the values of My and Nu. 

Figure 19(f) shows the sensitivity of performance to the 
value of Ng. Figure 19(g) shows a scaled-up version of this 


graph in the region of the optimal Ng value. Both ATLAS 
Model and ATLAS CGw/S choose good values of Ng. In 
Figure 19(g), the saw-tooth with period 2 arises from the 
overhead of executing clean-up code when the value of Np is 
odd, and therefore not divisible by the value of My(= 2). As 
on other machines, we do not understand the saw-tooth with 
period 4 that has larger spikes in performance. 

Figure 19(h) shows the sensitivity of performance to the 
value of Ky. The L1 I-cache is large enough to permit full 
unrolling (Ky = Np). However, the sensitivity graph of Ky is 
anomalous; performance is relatively low for all values of Ky 
other than Ky = Np. By examining the code produced by the 
native compiler (GCC), we found that this anomaly arose from 
interference between instruction scheduling in ATLAS and 
instruction scheduling in GCC. Notice that ATLAS CGw/S 
uses FMA = 0, so it attempts to schedule instructions and 
perform software pipelining in the mini-MMM code. Fully 
unrolling the k’ loop (Ku = Np) produces straight-line code 
which is easier for GCC to schedule. 

To verify this conjecture, we redid the Ky sensitivity study 
with FMA set to 1. Figure 9 shows the results. Setting 
FMA = 1 dissuades the ATLAS Code Generator from 
attempting to schedule code, so GCC has an easier job, 
producing a Ky sensitivity graph that is in line with what 
we would expect. 

Notice that our refined model, described in the context of 
the Opteron, does exactly on this. Using this model, mini- 
MMM performance is 1544 MFLOPS, which is faster than the 
performance of the mini-MMM produced by ATLAS CGw/S. 
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Fig. 9. AMD Athlon MP: Sensitivity of performance to Ky 
Figure 19(i) shows the sensitivity of performance to the 
value of the Ls. 


I. Pentium III 


1) mini-MMM: On this machine, mini-MMM code pro- 
duced by ATLAS Model is about 50 MFLOPS (6%) slower 
than mini-MMM code produced by ATLAS CGw/S. The 
code produced by ATLAS Unleashed performs roughly 50 
MFLOPS better than the code produced by ATLAS CGw/S. 

The difference in performance between the codes produced 
by ATLAS CGw/S and ATLAS Model arises mostly from the 
sub-optimal register tile chosen by the model, as explained in 
the context of the Opteron in Section V-G. Using (6,1) as the 
register tile raises mini-MMM performance to 916 MFLOPS. 


2) MMM Performance: Figure 20(d) shows MMM per- 
formance. The hand-coded BLAS perform at roughly 1100 
MFLOPS, whereas the codes produced by ATLAS CGw/S 
and ATLAS Unleashed perform roughly at 900 MFLOPS. 
The code produced by ATLAS Model runs roughly at 850 
MFLOPS; using the refined model improves performance to a 
point that is slightly above the performance of code produced 
by ATLAS CGw/S. 

3) Sensitivity Analysis: Figure 20(e) shows the sensitivity 
of performance to the values of My and Nu. Like all x86 
machines, the Pentium III has a limited number of logical 
registers. Our base-line model picked (2,1) for the register 
tile, whereas ATLAS CGw/S chose (4, 1). If we use the refined 
model described in Section V-G, the size of the register tile 
becomes (6,1), and mini-MMM performance rises to 916 
MFLOPS. 

Figure 20(f) shows the sensitivity of performance to the 
value of Ng. Figure 20(g) shows a scaled-up version of this 
graph in the region of the optimal Ng value. The broad peak 
in Figure 20(f) arises from the influence of the L2 cache 
(capacity of 512 KB). Using Inequality (4) for the L2 cache, 
we obtain Ng = 104, which is the Ng values where the 
peak starts, while Inequality (8) gives Ng = 164, which 
corresponds to the Ng value where the peak ends. The L2 
cache on the Pentium III is 8-way set-associative, so the drop 
in performance between Ng = 104 and Ng = 164 is small. 

Figure 20(h) shows the sensitivity of performance to the 
value of the Ky. On this machine, the L1 instruction cache is 
large enough to permit full unrolling (Ky = Nez). 

Figure 20(i) shows the sensitivity of performance to the 
value of the Ls. There is no fused multiply-add instruction, 
so performance is sensitive to the value of Ls, but both ATLAS 
Model and ATLAS CGw/S find reasonable values for this 
parameter. If we use the refined model described in Section V- 
G, we set FMA = 1, and the value of the L, parameter 
becomes irrelevant. 


J. Pentium 4 


1) mini-MMM: On this machine, mini-MMM code pro- 
duced by ATLAS Model is about 600 MFLOPS (40%) slower 
than mini-MMM code produced by ATLAS CGw/S. This is 
mostly because of the sub-optimal register tile used by ATLAS 
Model; changing it to (6, 1) improves the performance of mini- 
MMM code produced by ATLAS Model to 1445 MFLOPS, 
which is only 50 MFLOPS (3%) less than the performance of 
the mini-MMM code produced by ATLAS CGw/S. 

The mini-MMM produced by ATLAS Unleashed is roughly 
twice as fast as the mini- MMM produced by ATLAS Model 
because this code uses the SSE2 vector extensions to the x86 
instruction set. 

2) MMM Performance: Figure 21(d) shows the MMM 
performance. The hand-coded BLAS routines for this machine 
perform best, followed by the code produced by ATLAS 
Unleashed. Both the hand-coded BLAS and the code produced 
by ATLAS Unleashed use the SSE2 vector extensions, and 
this accounts for most of the gap between these codes and the 
codes produced by ATLAS Model and ATLAS CGw/S. We 


do not know why the hand-coded BLAS perform substantially 
better than the code produced by ATLAS Unleashed. 

The gap in performance between the codes produced by 
ATLAS CGw/S and ATLAS Model disappears if the refined 
model for register tiles is used. 

3) Sensitivity Analysis: Figure 21(e) shows the sensitivity 
of performance to the values of My and Ny. This figure shows 
that the best register tile is (5, 1), which produces mini- MMM 
code that runs at 1605 MFLOPS. Using (6, 1) as the register 
tile is not as good because it reduces performance to 1521 
MFLOPS. 

Figure 21(f) shows the sensitivity of performance to the 
value of the Ng. Figure 21(g) shows a scaled-up version 
of this graph in the region of the optimal Ng value. Both 
ATLAS Model and ATLAS CGw/S choose good tile sizes for 
the L1 cache. Tiling for the L2 cache gives slightly better 
performance. The L2 cache on this machine has a capacity of 
256 KB; using Inequalities (4) and (8), we get Ng = 105 and 
Np = 180, which agree well with the data. 

Figure 21(h) shows the sensitivity of performance to the 
value of Ky. On this machine, the L1 instruction cache is 
large enough to permit full unrolling (Ky = Nez). 

Figure 21(1) shows the sensitivity of performance to the 
value of Ls. 


K. Discussion 


The experimental results in this section can be summarized 
as follows. Figure 10 describes the analytical models used to 
compute values for the optimization parameters. This figure 
also shows the refined model used to compute register tile 
values for the x86 architectures. 

Figure 11 shows the relative performance of the mini- MMM 
codes produced by ATLAS Model and by ATLAS Unleashed, 
using the performance of the codes produced by ATLAS 
CGw/S as the base line (the 100% line in this figure represents 
the performance of ATLAS CGw/S on all machines). All the 
performance numbers for ATLAS Model in this graph are 
obtained by tiling for the L1 cache. 

We see that on all machines other than the Itanium, the 
codes produced by using the analytical models perform almost 
as well or slightly better than the codes produced using global 
search. On the Itanium, we saw that it is best to tile for the L3 
cache, rather than the L1 cache. By using the L2 cache instead, 
ATLAS CGw/S was able to obtain some of the benefits of 
tiling for the L3 cache. If we use this value in the model 
of Figure 10, we produce mini-MMM code of comparable 
performance. Using the actual capacity of the L3 cache gives 
even better performance. 

In our experiments we noticed that on several platforms, 
we get better MMM performance by tiling for a lower cache 
level, such as L2 or L3, rather than L1. This may result in a 
large value for Ng, which may hurt overall performance if the 
resulting MMM library routine is invoked from other routines 
such as LU and Cholesky factorizations [22]. It is unclear to 
us that this is an issue in the context of compilers, where codes 
like LU and Cholesky would be optimized directly, rather than 
built upon MMM. 
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Trim Np, to make it a multiple of My, Ny, and 2. 

Estimating Ky: 

Choose Ky as the maximum value for which mini-MMM fits in 
the L1 instruction cache. Trim Ky to make it divide Np evenly. 
Estimating Fp, Ip, and Np: 
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Fig. 10. Summary of Model 
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Fig. 11. Summary of mini-MMM Performance. Performance numbers are 
normalized to that of ATLAS CGw/S, which is presented as 100% 
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VI. CONCLUSIONS AND FUTURE WORK 


... the end of all our exploring 
Will be to arrive where we started 
And know the place for the first time. 


T.S.Eliot, Four Quartets 


The experimental results in this paper demonstrate that it is 
possible to use analytical models to determine near-optimal 
values for the optimization parameters needed in the ATLAS 
system to produce high-quality BLAS codes. The models in 
this paper were designed to be compatible with the ATLAS 
Code Generator; for example, since ATLAS uses square cache 
tiles, we had only one parameter Ng, whereas a different 
Code Generator that uses general rectangular tiles may require 
three cache tile parameters. Van de Geijn and co-workers have 
considered such models in their work on optimizing matrix 
multiplication code for multi-level memory hierarchies [20, 
21,24]. 

Our results show that using models to determine values 
for the optimization parameters is much faster than using 
empirical search. However, this does not imply that search has 
no role to play in the generation of high-performance code. 
Systems like FFTW and SPIRAL use search not to choose 
optimal values for transformation parameters, but to choose an 
optimal algorithm from a whole suite of algorithms. We do not 
know if model-driven optimization is effective in this context. 
Even in the relatively simple context of the BLAS, there are 
aspects of program behavior that may not be worth modeling 
in practice even if they can be modeled in principle. For 
example, the analytical models for Ng described in Section IV 
ignore conflict misses. Although there is some work in the 
compiler literature on modeling conflict misses [10, 12], these 
models appear to be computationally intractable. Fortunately, 
the effect of conflict misses on performance can be reduced by 
appropriate copying. If necessary, the value of Ng found by 
the model can be refined by local search in the neighborhood 
of the Ng value predicted by the model. This combination of 
modeling and local search may be the most tractable approach 
for optimizing large programs for complex high-performance 
architectures. 

At the end of this paper, we are left with the same question 
that we asked at its beginning: how do we improve the state of 
the art of compilers? Conventional wisdom holds that current 
compilers are unable to produce high-quality code because the 
analytical models they use to estimate optimization parameter 
values are overly simplistic compared to the complexity of 
modern high-performance architectures. The results in this 
paper contradict this conventional wisdom, and suggest that 
there is no intrinsic reason why compilers cannot use analytical 
models to generate excellent code, at least for the BLAS. 

However, it is important not to underestimate the challenge 
in improving general-purpose compilers to bridge the current 
performance gap with library generators. Although the tech- 
niques used by ATLAS, such as loop tiling, unrolling, and 
instruction scheduling, have been in the compiler literature for 
many years, it is not easy to incorporate them into general- 
purpose compilers. For example, transformations such as tiling 
are not always legal, so a general-purpose compiler must 


perform dependence analysis before transforming a program. 
In contrast, the implementor of a library generator focuses 
on one application and knows the precise structure of the 
code to be generated for that application, so he is not en- 
cumbered by the baggage required to support restructuring of 
general codes. At the very least, improving the state of the 
art of compilation technology will require an open compiler 
infrastructure which permits researchers to experiment easily 
with different transformations and to vary the parameters of 
those transformations. This has been a long-standing problem, 
and no adequate infrastructure exists in spite of many attempts. 

An equally important conclusion of this study is that there 
is still a significant gap in performance between the code 
generated by ATLAS CGw/S and the vendor BLAS routines. 
Although we understand some of the reasons for this gap, the 
problem of automating library generation remains open. The 
high cost of library and application tuning makes this one of 
the most important questions we face today. 
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Architecture 
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Fortran Compiler 
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GNU Fortran 3.3 
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IBM POWER 3: PLATFORM SPECIFICATION 
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IBM POWER 3: OPTIMIZATION PARAMETERS 


SC 


Machine Parameters 139s 154s 
Optimization Parameters 1984s 
[Tot EEE 
TABLE 13(c) 


IBM POWER 3: TIMINGS 


A 
AZ 
i? my RNN 
m 500 ga 


1234567 8 9 10111213141516 16 


Fig. 13(e). IBM Power 3: Sensitivity of performance to My and Ny 
MFLOPS 
1400 
1200 
1000 
800 
600 
400 
200 
NB 
20 40 60 80 100 120 
Fig. 13(g). IBM Power 3: Sensitivity of performance to Ng (zoomed) 
MFLOPS 
¢—_+_+—__4—__4—__+_+—__+ D S 
1200 
1000 
800 
600 
400 
200 
LS 
2 4 6 8 10 12 
Fig. 13(i). IBM Power 3: Sensitivity of performance to Ls 


Architecture Out-Of-Order, RISC 
CPU Core Frequency 1450 MHz 
L1 Data Cache 32 KB, 128 B/line, 2-way 
L1 Instruction Cache 64 KB, 128 B/line, 1-way 
L2 Unified Cache 1.5 MB, 128 B/line, 8-way 
L3 Cache 32 MB, 512B/line, 8-way 


Floating-Point Registers 32 
Floating-Point Functional Units 

Floating-Point Multiply Latency 
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Operating System AIX 
C Compiler XL C for AIX v.5 
Fortran Compiler XL Fortran for AIX 
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SUN ULTRASPARC III: PLATFORM SPECIFICATION 
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Architecture In-Order, EPIC, [A-64 
CPU Core Frequency 1500 MHz 
L1 Data Cache 16 KB, 64 B/line, 4-way 
L1 Instruction Cache 16 KB, 64 B/line, 4-way 
L2 Unified Cache 256 KB, 128 B/line, 8-way 
L3 Cache 3 MB, 128B/line, 12-way 


Floating-Point Registers 128 
Floating-Point Functional Units 

Floating-Point Multiply Latency 

Has Fused Multiply Add 
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INTEL ITANIUM 2: PLATFORM SPECIFICATION 
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Architecture Out-Of-Order, CISC, x86-64 
CPU Core Frequency 1400 MHz 
L1 Data Cache 64 KB, 64 B/line, 2-way 
L1 Instruction Cache 64 KB, 64 B/line, 2-way 
L2 Unified Cache 1024 MB, 64 B/line, 16-way 
Floating-Point Registers 8 x87 
ADD + MUL + Memory 


Floating-Point Functional Units 
Floating-Point Multiply Latency 4 
Has Fused Multiply Add No 
Operating System Linux 2.4.19 
C Compiler GCC C/C++ 3.3.2 


Fortran Compiler GNU Fortran 3.3.2 
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Machine Parameters 148s 101s 
tlt Parameters 556s 

| 

TABLE 18(c) 
AMD OPTERON 240: TIMINGS 


123 4 5 67 8 9 1011 12 13 1415 16 


Fig. 18(e). AMD Opteron 240: Sensitivity of performance to My and Ny 
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Fig. 18(g). AMD Opteron 240: Sensitivity of performance to Ng (zoomed) 
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Fig. 18G)._ AMD Opteron 240: Sensitivity of performance to Ls 


Architecture Out-Of-Order, CISC, x86 


CPU Core Frequency 1733 MHz 
L1 Data Cache 64 KB, 64 B/line, 2-way 
L1 Instruction Cache 64 KB, 64 B/line, 2-way 
L2 Unified Cache 256 KB, 64 B/line, 16-way 


Floating-Point Registers 8 


Floating-Point Functional Units ADD + MUL + Memory 
Floating-Point Multiply Latency 4 
Has Fused Multiply Add No 
Operating System Linux 2.4.20 
C Compiler GNU C/C++ 3.2.2 
Fortran Compiler GNU Fortran 3.2.2 


TABLE 19(a) 
AMD ATHLON MP: PLATFORM SPECIFICATION 
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Fig. 19(d). AMD Athlon MP: MMM Performance 
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Fig. 19(f). AMD Athlon MP: Sensitivity of performance to Ng 
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Fig. 19(h). AMD Athlon MP: Sensitivity of performance to Ky 
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a 4, 1, 76 0, 3, 2 1531 
Model 2, 1, 88 0, 2, 2 1239 
Unleashed 2512 


TABLE 19(b) 
AMD ATHLON MP: OPTIMIZATION PARAMETERS 


TT Search] Moa] 
Machine Parameters 220s 121s 
Cpe Parameters 3195s 

[Tot id EES 

TABLE 19(c) 
AMD ATHLON MP: TIMINGS 


123 45 67 8 9 1011 12131415 16 


Fig. 19(e). AMD Athlon MP: Sensitivity of performance to My and Ny 
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Fig. 19(g). AMD Athlon MP: Sensitivity of performance to Ng (zoomed) 
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Fig. 19G). AMD Athlon MP: Sensitivity of performance to Ls 


Architecture 

CPU Core Frequency 

L1 Data Cache 

L1 Instruction Cache 

L2 Unified Cache 
Floating-Point Registers 
Floating-Point Functional Units 
Floating-Point Multiply Latency 
Has Fused Multiply Add 
Operating System 

C Compiler 

Fortran Compiler 


Out-Of-Order, CISC, x86 
1266 MHz 

16 KB, 32 B/line, 4-way 
16 KB, 32 B/line, 4-way 
512 MB, 32 B/line, 8-way 


Linux 2.4.20-28.8smp 
GNU C/C++ 3.2 
GNU Fortran 3.2 


TABLE 20(a) 
PENTIUM III: PLATFORM SPECIFICATION 


MFLOPS 


120 
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Fig. 20(d). Pentium III: MMM Performance 
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Fig. 20(f). Pentium III: Sensitivity of performance to Ng 
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Fig. 20(h). Pentium III: Sensitivity of performance to Ky 
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TABLE 20(b) 


PENTIUM III: OPTIMIZATION PARAMETERS 


es 
Machine Parameters 133s 100s 
Optimization Parameters 630s 


[Tol id 7638 | T00S | 


TABLE 20(c) 
PENTIUM III: TIMINGS 


J, I, 44 0, 3, 2 
2, 1, 42 0, 2, 2 
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Fig. 20(e). Pentium III: Sensitivity of performance to My and Ny 
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Fig. 20(g). Pentium III: Sensitivity of performance to Ng (zoomed) 
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Fig. 20(i). Pentium III: Sensitivity of performance to Ls 


Architecture Out-Of-Order, CISC, x86 
CPU Core Frequency 2000 MHz 
L1 Data Cache 8 KB, 64 B/line, 4-way 
L1 Instruction Cache 12 K uOPs, 6 uOPs/line, 8-way 
L2 Unified Cache 512 KB, 128 B/line, 8-way 
Floating-Point Registers 


Floating-Point Functional Units 

Floating-Point Multiply Latency 

Has Fused Multiply Add 

Operating System Linux 2.4.20-30.9smp 
C Compiler GNU C v3.2.2 
Fortran Compiler GNU Fortran 3.2.2 


TABLE 21 (a) 
PENTIUM 4: PLATFORM SPECIFICATION 


MFLOPS 
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Fig. 21(d). Pentium 4: MMM Performance 
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Fig. 21(f). Pentium 4: Sensitivity of performance to Ng 
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Fig. 21(h). Pentium 4: Sensitivity of performance to Ky 


Fig. 21(e). 


Fig. 21(g). 


Fig. 


Model 


oo 3, 1, 28 0, 2, 1 1504 
1, 1, 30 0, 2, 2 913 
Unleashed 3317 
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TABLE 21(b) 
PENTIUM 4: OPTIMIZATION PARAMETERS 


oe 
Machine Parameters 136s 98s 
Optimization Parameters 643s 
| 
TABLE 21(c) 
PENTIUM 4: TIMINGS 


123 45 67 8 9 1011 12 13 1415 16 
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Pentium 4: Sensitivity of performance to My and Ny 
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Pentium 4: Sensitivity of performance to Ls 


